KDGene: knowledge graph completion for disease gene prediction using interactional tensor decomposition

Abstract The accurate identification of disease-associated genes is crucial for understanding the molecular mechanisms underlying various diseases. Most current methods focus on constructing biological networks and utilizing machine learning, particularly deep learning, to identify disease genes. However, these methods overlook complex relations among entities in biological knowledge graphs. Such information has been successfully applied in other areas of life science research, demonstrating their effectiveness. Knowledge graph embedding methods can learn the semantic information of different relations within the knowledge graphs. Nonetheless, the performance of existing representation learning techniques, when applied to domain-specific biological data, remains suboptimal. To solve these problems, we construct a biological knowledge graph centered on diseases and genes, and develop an end-to-end knowledge graph completion framework for disease gene prediction using interactional tensor decomposition named KDGene. KDGene incorporates an interaction module that bridges entity and relation embeddings within tensor decomposition, aiming to improve the representation of semantically similar concepts in specific domains and enhance the ability to accurately predict disease genes. Experimental results show that KDGene significantly outperforms state-of-the-art algorithms, whether existing disease gene prediction methods or knowledge graph embedding methods for general domains. Moreover, the comprehensive biological analysis of the predicted results further validates KDGene’s capability to accurately identify new candidate genes. This work proposes a scalable knowledge graph completion framework to identify disease candidate genes, from which the results are promising to provide valuable references for further wet experiments. Data and source codes are available at https://github.com/2020MEAI/KDGene.


Section II. Results of different hyper-parameter settings
The hyper-parameters of KDGene that demand our attention mainly include the embedding dimensions (comprising both entity and relation dimensions), batch size, learning rate, and regularization coefficient.
In determining the optimal embedding dimensions for entities and relations, we explored three values: 1000, 1500, and 2000, resulting in nine distinct combinations.Our analysis of these combinations revealed that the selection of 2000 dimensions for entities and 1500 for relations offered optimal performance.It is important to note, however, that the performance differences across various combinations were not substantial.This observation suggests that while our chosen configuration provides an effective balance, users may select alternative dimension settings based on the memory capacities of their systems.Such a decision allows for flexibility in resource allocation, ensuring that the model's deployment is efficient and practical.
In our evaluation, we experimented with batch sizes of 128, 256, 512, and 1024.The aggregated results suggest that a batch size of 512 yields the best overall performance.However, practical considerations led us to adopt a batch size 1024 for our final model training.This decision was informed by recognizing that smaller batch sizes, while potentially more optimal, relatively increase training duration.By opting for the larger batch size, we struck a balance between computational efficiency and performance.Users are encouraged to adjust the batch size according to their specific computational constraints and training time considerations, thereby customizing the model training to their unique operational contexts.
When analyzing the effect of the learning rate on our model, we considered values of 0.01, 0.03, 0.05, and 0.1.We ultimately selected 0.05 for our model.Our selection of the 0.05 learning rate was based on its peak performance across our evaluations.However, the experimental results indicate that specifically for the DisGeNet dataset, a learning rate of 0.1 is also a feasible choice, providing satisfactory outcomes.It's important to consider that higher learning rates can lead to faster convergence but also carry the risk of missing the optimal solution.This factor becomes particularly crucial when adapting our model to various datasets, where the optimal learning rate may differ.
Our analysis for selecting the appropriate regularization coefficient considered values of 0, 0.001, 0.01, and 0.1.The performance metrics distinctly favored a coefficient of 0.1, indicating that this level of regularization effectively mitigates overfitting without overly constraining the model's capacity to learn from the data.A coefficient of 0, which corresponds to no regularization, typically leads to overfitting, as suggested by the lower performance metrics-a pattern that aligns with the outcomes observed for our baseline model CP-N3 [1] when applied to general knowledge graphs.This consistency in results reinforces the importance of regularization for tensor decomposition-based models, where the right amount of regularization is crucial for generalization and robust performance.

Section III. The computational cost and complexity of KDGene
We analyzed the computational cost and complexity of KDGene from the following three aspects: the scale of the knowledge graph, model parameters, and experimental details.

1) The scale of the knowledge graph
Our method introduces additional entities and relations, which naturally increases model parameters.This increase is scalable and depends on the size of the biological knowledge graph (KG).For example, incorporating the disease-symptom relationships into our KG significantly enhanced model performance with a marginal increase of 11,607 entities.In addition, each type of relation is represented by a 1500-dimensional vector (our current hyper-parameter setting).There are 7 types of relations in our KG now, which means 7 * 1500-dimensional vector storage is required.In other words, relational representation does not significantly increase memory overhead.

2) Model parameters of KDGene
The interaction module, unique to KDGene, is the recurrent neural network unit, which facilitates effective learning of representations in vertical-domain knowledge graphs through a straightforward yet powerful mechanism.This module in KDGene is implemented using a LSTMCell, which introduces additional parameters inherent to their design, including gates for regulating the flow of information (input, output, and forget gates) and cell states.Our selected hyper-parameters, with an entity dimension of 2000 and a relation dimension of 1500, reflect this design choice.While this increases the model's parameter count, it remains within manageable limits and can be adjusted based on computational resources available to the user.To assist researchers in optimizing these parameters, we have included a comprehensive analysis of hyper-parameter tuning in our supplementary material, thus enabling fine-tuning of the model's performance to fit computational constraints.

3) Experimental details
Our experiments were all conducted in PyTorch and on GeForce GTX 2080Ti GPUs.We have implemented optimization strategies, including efficient memory management and parallel processing, to keep the increased computational load manageable.However, this investment in time is counterbalanced by a performance improvement of over 30%.This substantial enhancement in accuracy and predictive capabilities far outweighs the modest increase in computational effort.
The average training time was 86.75 seconds per epoch, with early stopping usually around 135 epochs.Comparatively, our baseline model CP-N3 recorded an average training time of 67.21 seconds per epoch across ten folds, with an average of 150 epochs until completion.Acknowledging the comparative analysis, it's evident that our model entails an average increase of 16.17% in training time compared to CP-N3.